* This project is EDA that I have submitted for 10 academy data science felowship pre-interview examination. However, it has some modification from original submission
A football game generates many events and it is very important and interesting to take into account the context in which those events were generated. This dataset should keep sports analytics enthusiasts awake for long hours as the number of questions that can be asked is huge.
Read these blogs to get a good understanding of soccer/football stats.
Data description:
Nearly 25,000 soccer games from all leagues all over the world. The fields in the data set are: Columns A to E contains information about the league, home and away teams, date etc Columns F, G and H contain the odds for the home win, draw and away win Columns I to BQ contain the team statistics.
Home team stats are prefixed with a "h" similarly, away team stats are prefixed with an "a". Examples include ladder position (which is a term for a rank in a group - here an example), games played, goals conceded, away games won etc. Columns BR to CA contain final result information. That is the result, the full time result and if available, the half time score as well.
For each game there is:
The dataset ranges from January 2016 to October 2017 and the statistics have been sourced from a few different websites. Odds come from BET365 and the results have been manually entered from http://www.soccerstats.com
Get more insight about the columns in the data by hovering your mouse in front of the names here
Data Location:
Exploratory analysis:
## required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
class SoccerDataset(object):
def __init__(self):
self.DATA_DIR='data/SoccerData.xlsx'
self.sheet='All Data'
self.data=self.load_data()
#load data
def load_data(self):
data = pd.read_excel(self.DATA_DIR, self.sheet, index_col=None)#reading data in particular sheet 'All Data'
return data
##get list of column names
def getColumns(self):
return self.data.columns
##shows non null value caount and data type per column
def getDataInfo(self):
return self.data.info()
#Loading Data
SoccerDataset=SoccerDataset()
data=SoccerDataset.data
## get columns
SoccerDataset.getColumns()
fig, ax = plt.subplots(figsize=(15, 5))
data.groupby('league').count()[['home_team']].plot.bar(
ax=ax,
title="Frequency distribution of Leagues over teams")
plt.show()
## get info about data
SoccerDataset.getDataInfo()
## to check distribution of each categorical features we should define data type to appropriate value.
##Here I am casting data type of all categorical features to int16 except RESULT column.
## Because later, I will replace the values of target column to fixed number to indicate WIN, LOSS and DRAW
data['league'] =data['league'].astype('category').cat.codes
data['home_team'] =data['home_team'].astype('category').cat.codes
data['away_team'] =data['away_team'].astype('category').cat.codes
data['date'] =data['date'].astype('category').cat.codes
## lets split data based on result feature to explore probability of winning if home or away team
data['RESULT'] = data['RESULT'].replace(['HOME'], 'WIN')#win
data['RESULT'] = data['RESULT'].replace(['AWAY'], 'LOSS')#loss
data['RESULT'] = data['RESULT'].replace(['DRAW'], 'DRAW')#draw
data['RESULT'].head()
## lets split data based on result feature to explore probability of winning if home or away team
# score_data_home = data[data['RESULT'] == 'HOME'] ##only home
data['RESULT'] = data['RESULT'].replace(['HOME'], 'LOSS')
data['RESULT'] = data['RESULT'].replace(['AWAY'], 'WIN')
data['RESULT'] = data['RESULT'].replace(['DRAW'], 'DRAW')
data['RESULT'].head()
data.head()
Relationship of playing at Home or Away with Result
# data=data.loc[data['RESULT'] == 'HOME']
sns.set(rc={'figure.figsize':(10,300)})
ax=sns.countplot(y='home_team',hue='RESULT',data=data)
for p in ax.patches:
patch_height = p.get_height()
if np.isnan(patch_height):
patch_height = 0
ax.annotate('{}'.format(int(patch_height)), (p.get_x()+0.01, patch_height+3))
plt.title("Teams played at Home Vs. Chance of win/Loss/Draw")
plt.show()
Here, class HOME means WIN
sns.set(rc={'figure.figsize':(10,300)})
ax=sns.countplot(y='away_team',hue='RESULT',data=data)
for p in ax.patches:
patch_height = p.get_height()
if np.isnan(patch_height):
patch_height = 0
ax.annotate('{}'.format(int(patch_height)), (p.get_x()+0.05, patch_height+10))
plt.title("Teams played at Away Vs. Chance of win/Loss/Draw")
plt.show()
From the above two plots, we can see that playing at home (one with orange color) has great chance to win than playing away(one with green color).
_=data.hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8)
As we can see from the above distribution plot, most of features in the same group has same distribution. For example pa_ladder 1,2,3,4,5 with ph_ladder 1,2,3,4,5. So,we can merge the features.
#Using Pearson Correlation
plt.figure(figsize=(5,5))
cor=data[['home_odd','draw_odd','away_odd','RESULT']].corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()
#Lets explore more about corelation with other final result values
#Using Pearson Correlation
plt.figure(figsize=(10,10))
cor=data[['home_odd','draw_odd','away_odd','RESULT']].corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()
#Teams played more than 10 games
played_teams_count = data['home_team'].value_counts()
no_game_played=[]
most_played_teams=[]
for i,team in zip(played_teams_count.index,played_teams_count.values):
if team>10:
no_game_played.append(team)
most_played_teams.append(i)
soccer_data=pd.DataFrame({'Team':most_played_teams,'GameCount':no_game_played})
soccer_data.tail(20) ##to check it only includes game count greater than 10. N:B: the data is decendingly sorted.
import numpy as np
corr = data.corr()
fig = plt.figure(figsize=(25,25))
ax = fig.add_subplot(111)
cax = ax.matshow(corr,cmap='coolwarm', vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,len(data.columns),1)
ax.set_xticks(ticks)
plt.xticks(rotation=90)
ax.set_yticks(ticks)
ax.set_xticklabels(data.columns)
ax.set_yticklabels(data.columns)
plt.show()
From the above correlation matrix when there is no correlation between 2 variables (when correlation is 0 or near 0) the color is gray. The darkest red means there is a perfect positive correlation, while the darkest blue means there is a perfect negative correlation. The matrix gives as interesting focus to drop or retain features and which features has great impact. But before decision we should resample the data to minimize influence of outliers.
target_count = data.RESULT.value_counts()
print('AWAY:', target_count[0])
print('HOME:', target_count[1])
print('DRWAN:', target_count[2])
fig, ax = plt.subplots(figsize=(5, 5))
ax.tick_params(axis='x', labelsize=15)
ax.tick_params(axis='y', labelsize=10)
ax.set_xlabel('Result', fontsize=15)
ax.set_ylabel('Class Count' , fontsize=15)
ax.set_title('Distribution of traget classes (AWAY/HOME/DRAWN)', fontsize=15, fontweight='bold')
_=target_count.plot(ax=ax, kind='bar')
##As we can see from the result, data distribution for AWAY, HOME and DROWN is relatively unballanced so I have to consider
## ballancing mechanisims appropriate to the problem. Lets use ... approach because of